Compliance Conundrums: Implementing PREMIS at two National Libraries

نویسنده

  • Haliza Jailani
چکیده

The purpose of this paper is to examine compliance with PREMIS at National Library Board Singapore and the National Library of New Zealand. It will look in detail at how the development process, variation in content types, existing embedded technologies, and current knowledge all play a role in influencing the shape of the preservation metadata that is created, stored and used in a digital preservation system. Introduction PREMIS (Preservation Metadata: Implementation Strategies) is the de facto standard for digital preservation metadata. With a clearly defined scope, it details a large part of the metadata required to manage digital objects across time. This paper uses the experience at National Library Board, Singapore and the National Library of New Zealand to discuss what it means to be compliant with the PREMIS Data Dictionary. The Data Dictionary is well-written and conveys complex issues in a relatively concise and simple manner. However, this does not necessarily protect against misunderstandings and it certainly does not guarantee that it will not be deliberately or otherwise reinterpreted. Standards Standards serve multifarious purposes. For digital preservation, where well-documented, consistent actions, undertaken on fully described and identified content are the cornerstones of success, standards are critical. Briefly they offer: • Consistency: standards allow implementers to use homogenous metadata to manage their content. • Consensus: standards are created by experts in the field. Implementers benefit from agreement on best practices. • Sharing: crucial in the worst-case scenarios, where another organisation must take custody of the content. Sharing content can also aid in mitigating risks. • History (or perhaps better expressed as ‘Memory’): Standards should document why they are being used and the meaning behind their use. This allows future users to understand what was being done and how they can interpret it. The digital preservation community consistently refers to a number of touchstones. While a variety of standards and frameworks are often invoked (e.g., Trusted Digital Repository, METS) the two main metaphorical pieces of jasper used to test value are in the shape of the OAIS model and PREMIS. This paper is concerned with conformance to the PREMIS data dictionary. Compliance Complying with standards in the heritage sector is a matter of institutional rigour driven primarily by perceived benefit, rather than audited necessity. Which is to say; there are no fiscal ramifications for erroneously asserting compliance (of course, there may be other ramifications, e.g. reputational risk). The benefits of compliance must therefore be sufficiently strong. The task then for implementers (potential and definite) is to judge the benefits against any barriers to conformance. PREMIS conformance The PREMIS Committee’s statement on conformance suggests that the levels are “lightweight, and considerable scope for flexibility and choice is reserved for implementing repositories”. [1] There is perhaps some dissonance here though with a statement in the Dictionary that says it lists ‘”implementable metadata”: rigorously defined’ [2]. With such rigour a more stringent conformance level would be expected. In terms of benefits that should drive conformance, the Committee’s list includes inter-repository data exchange, certification, shared registries, automation, and vendor support [3]. This list is, unsurprisingly, mostly in accord with the benefits listed above. Institutional Background Both organisations use the Rosetta digital preservation system, which was developed by Ex Libris in conjunction with the National Library of New Zealand (NLNZ). While care will be taken to clarify where implementation choices have been made by the institutions and where decisions have been made by the vendor, the development role played by NLNZ means that this boundary is not fully demarcated. The National Library of New Zealand has a legal mandate to collect and preserve documentary heritage and taonga (treasured items) for all people of New Zealand [4]. New Zealand legislation explicitly includes digital content as falling under this mandate. In practice, the Library collects and receives content in variegated formats. National Library Board, Singapore (NLB) is mandated the function of preserving the published heritage of the nation through legal deposit. By law, every Singapore publisher must deposit copies of every publication published in the Republic with the NLB [5]. The Board has undertaken the task of preserving this collection as part of its responsibility. A review conducted by the Board in 2005 had recommended the building of infrastructure and 1 For good or ill, we use ‘compliance’ and ‘conformance’ interchangeably in this paper. 244 ©2012 Society for Imaging Science and Technology a centralised database for the preservation and access of Legal Deposit materials and the wider ambit of national heritage materials [6]. To this end, the Rosetta system was implemented. PREMIS Implementation As stated above, both organisations implement the Ex Libris Rosetta system. Their implementation of PREMIS is through Rosetta. Rosetta undertakes various processes on content as it is ingested, adding metadata to the intellectual entity. Simply, the end result is that the content files are placed in the permanent storage along with a METS file, which contains the metadata that both organisations have deemed to be required for permanent preservation of the content. This METS file contains, but not exclusively, data that is expressed in a schema called “the DNX”. This in turn contains, again, not exclusively, the PREMIS Data Dictionary. Figure 1. Structure of Rosetta METS file Examining conformance Both Libraries undertook comparisons of the data contained within their METS files against the PREMIS data dictionary. This comparison included checking: 1. The nomenclature used; 2. The semantics of this nomenclature; 3. The sub-level of object that the units are used at; 4. The obligation associated with the units; and, 5. The repeatability of the units. The following section will attempt to highlight one or two key areas in each of the five areas of comparison. Nomenclature and Semantics With the exception of objectCharacteristicsExtension which is optional, the semantic components of PREMIS objectCharacteristics are present in Rosetta’s DNX albeit with some name variations. They are mostly captured under a DNX element called generalFileCharacteristics. PREMIS format and fixity semantic components are captured separately under DNX fileFormat and fileFixity container units respectively. These containers group the capture of granular details such as the different values for semantic components of fixity types MD5, SHA1 and CRC32, in a more user-friendly way. Of note however is the use of the metadata element objectCharacteristics which does not share the definition of PREMIS objectCharacteristics as its sub-units relate to metadata such as objectType, parentID, groupID, creationDate, createdBy, modificationDate, modifiedBy and owner. The DNX objectCharacteristics is not used for format specific technical metadata as defined in PREMIS. In addition, this metadata element applies to the representation, file and bitstream levels while PREMIS objectCharacteristics is applicable to only file and bitstream. Although confusing at first, its use was adopted as there were no adverse impact on internal operations. External conformity might be an issue, although it is possible that this container unit might be excluded when extracting PREMISconformant information from Rosetta for another repository. Another non-conformance is storageMedium specified by PREMIS to be applicable only at the file and bitstream level. Rosetta defines this semantically in physicalCarrierMedia under generalRepCharacteristics, a semantic container at the representation level. In NLNZ and NLB’s business-as-usual routines, the outputs of our metadata extractors are mapped directly to the significantPropeties section in the DNX (significantPropertiesType, Value and Extension). In essence, it is the dumping ground of the technical properties as they are culled from the file. This information should, in order to conform, be placed in the objectCharacteristicsExtension section which is meant for additional object characteristics from format-specific technical metadata schemas such as the Z39.87-2006. The significantProperties container should be used to store properties “determined to be important to maintain through preservation actions” [7]. It is intended to enable flexibility for implementers but has caused inconsistency instead. Notably, the inclusion of external format-specific technical metadata is more easily done in PREMIS significantProperties than in PREMIS objectCharacteristicsExtension where explicit associations will require repeating the entire semantic unit and it is recommended that information about the external metadata be provided. There is a deeper discussion to be had however. NLNZ and NLB believe all technical properties to be important, irrespective of whether or not they should remain across an action. Some properties we may actually want to deliberately take action to remove from the file. These properties are significant and must be tracked across actions. This does not detract from the nonconformance, however it does raise questions about the purpose of the significant properties section in PREMIS. Event Entity PREMIS event information is recorded at file level in Rosetta with all semantic units except for linkingObjectIdentifier, present. In addition, event outcomes are detailed separately under DNX vs Outcome element with sub-units such as checkDate, agent, type, result, resultDetails, vs Evaluation and vs EvaluationDetails. These record information for specific events such as validation for checksum, file format, technical metadata, virus checks and risk analysis with the clear intention of ensuring clarity in detailing these checks. There is no non-compliance for repositories to capture more detailed information for a PREMIS semantic unit than what is defined in the Data Dictionary. This type of flexibility 2 We are keenly aware that there is a general agreement across DP literature with the PREMIS description. See for example [8]. Archiving 2012 Final Program and Proceedings 245 allows for a sufficient level of consistency and encapsulates the “implementable metadata” that is the intention of PREMIS. It has enabled both libraries considerable leeway in capturing additional required information. Agent Entity As would be expected of an implementation system, PREMIS agentIdentifier and agentName are implemented in Rosetta as metadata associated with individual events such as file fixity, virus check, file format, checksum and techMD outcomes. Other optional semantic components for this entity such as agentType and agentNote are not explicitly defined in Rosetta’s user interface. Rights Entity Rights semantic units are all optional at the container level. Except for access rights, these are not explicitly defined in Rosetta’s user interface although PREMIS indicated that the minimum rights information a repository should know is the rights to carry out preservation actions. Object level data PREMIS describes three levels of object: ‘representation’, ‘file’ and ‘bitstream’. In addition to these three, NLNZ and Singapore both use the level of Intellectual Entity as the primary unit of understanding digital content. The PREMIS Editorial Committee has stated that it is looking at the level of IE for the next version. This promises to be a strong addition to the dictionary. Examination of the bitstream level raises some important questions. This is an area that during development of the system was given a good deal of attention, but still remains a little ‘fuzzy’. Which is to say, there is a large legacy of diagrams, papers and requirements that try to finesse how bitstreams should be dealt with. However, a number of factors led to this part of system being as not well-resolved as the rest of it. Where is the boundary between bitstream and file? Without this boundary it is very hard to define exactly what the required functionality is. Without an exact requirement, its importance is questioned. This in turn means that if the requirements cannot be rigorously defended, then there is no strong driver (or will) to conform. This is discussed in more detail below. Obligation In Rosetta, obligation (the quality of being mandatory or not) denotes that a value is required to aid in processing the object through any number of its functions. It could be argued that this is a valid interpretation of the PREMIS definition, which states “A mandatory semantic unit is something the preservation repository needs to know” [9]. Regardless of interpreting what obligation means, there are some differences. For example, it is clear to both NLNZ and Singapore that fixity is a mandatory piece of information: it is a basic unit of tracking integrity and must be included with all files. It is only optional in PREMIS. PREMIS does have the correct sentiment: “Objects that lack these features [fixity, integrity, and authenticity] are of little value to repositories that have a mission to protect evidentiary value or indeed to preserve the cultural memory” [10]; but does not make fixity a mandatory element. This deviation does not make the implementation nonconformant, as obligation can be made more stringent without affecting conformance. But it does raise a question as to why this specific unit is optional in PREMIS. Repeatability In terms of repeatability, one of the more interesting differences is with format identification. PREMIS allows for repeatability of the format container. This means that multiple formats can be recorded against an object. We require however that each format coming into Rosetta is given a primary identification. This definite identification is the major driver of search, risk analysis, and preservation planning functionality. Multiple IDs with the same importance would impede this process. This is not to say that we only store one format identification. We collect and manage format identification from DROID, JHOVE, NLNZ MET, and the internal format library. But it does mean that we need to deviate from the Data Dictionary in order to be able to a) identify the definitive format identification, and b) capture the variety of format information we collect. This information is displayed in Table 1 below. Table 2 shows how this information could be presented in PREMIS. Across the two tables, the example is of a TIFF file being identified. Until currently, DROID has identified TIFF files with multiple identifiers. So in the PREMIS example, all the identifiers could be put into PREMIS. But it allows us no concept of primacy, and also does not give us other details that Table 1 does. For example, in Rosetta we also capture format information from the MD extraction process (in this case, JHOVE suggests that the file is TIFF version 5). Crucially though, the issue that in Table 1, the ID value that is used by the system is the formatLibraryID. In table 2, there is no clear field that would be used. The purpose of tables 1 and 2 is not to describe in detail the flows that lead to a given result in each unit (for Rosetta, these flows are complex and require more space than available to describe), but rather to show that while PREMIS allows repeatability of the format container, it does not allow us to specify which container is to be used as the definitive format identification. 3 DROID has recently added a new classifier for TIFFs that is a generic container for these multiple hits. 246 ©2012 Society for Imaging Science and Technology Table 1: Format Identification Information in DNX NLNZ/NLB Unit Value Description generalFileCharacteristic s.fileMIMEType audio/tiff From Format Library generalFileCharacteristic s.formatLibraryID Fmt/7 Definitive Format

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing PREMIS in Container Formats

In May 2005, the PREMIS Working Group released the first version of the Data Dictionary for Preservation Metadata: Final Report, a community consensus based preservation metadata standard. PREMIS XML schemas were also published to support the implementation. Since then, many organizations started implementing PREMIS in their repositories, during which a handful of common implementation issues s...

متن کامل

Practical Preservation: The PREMIS Experience

In 2003 the Online Computer Library Center (OCLC) and Research Libraries Group (RLG) established an international working group to develop a common, implementable core set of metadata elements for digital preservation. Most published specifi cations for preservation-related metadata are either implementation specifi c or broadly theoretical. PREMIS (Preservation Metadata: Implementation Strateg...

متن کامل

Analyzing the Role of Public Libraries Services in the Development of the Local Economy: Case Study of Kermanshah Province

Purpose: This study aimed to analyze the role of public libraries in the development of the local economy from the perspective of librarians working in public libraries of Kermanshah province in the current and desirable situation of public libraries. This study extracts the activities that the public library can do in the development of the local economy in the form of information and informat...

متن کامل

Provenance Description of Metadata using PROV with PREMIS for Long-term Use of Metadata

Provenance description is necessary for long-term preservation of digital resources. Open Archival Information System (OAIS) and Preservation Metadata: Implementation Strategies (PREMIS), which are well-known standards designed for digital preservation, define descriptive elements for digital preservation. Metadata has to be preserved as well as primary resource in order to keep the primary res...

متن کامل

A Standards Framework for Digital Library Programmes

This paper describes a layered approach to selection and use of open standards which is being developed for digital library development work within the UK. This approach reflects the diversity of the technical environment, the service provider's environment, the user requirements and maturity of standards by separating contextual aspects; technical and non-technical policies; the selection of a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012